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Abstract —Cost-efficient compressive sensing is challenging 
when facing large-scale data, i.e., data with large sizes. Con¬ 
ventional compressive sensing methods for large-scale data will 
suffer from low computational efficiency and massive memory 
storage. In this paper, we revisit well-known solvers called greedy 
algorithms, including Orthogonal Matching Pursuit (OMP), Sub¬ 
space Pursuit (SP), Orthogonal Matching Pursuit with Replace¬ 
ment (OMPR). Generally, these approaches are conducted by 
iteratively executing two main steps: 1) support detection and 2) 
solving least square problem. 

To reduce the cost of Step 1, it is not hard to employ the sensing 
matrix that can be implemented by operator-based strategy 
instead of matrix-based one and can be speeded by fast Fourier 
Transform (FFT). Step 2, however, requires maintaining and 
calculating a pseudo-inverse of a sub-matrix, which is random 
and not structural, and, thus, operator-based matrix does not 
work. To overcome this difficulty, instead of solving Step 2 
by a closed-form solution, we propose a fast and cost-effective 
least square solver, which combines a Conjugate Gradient (CG) 
method with our proposed weighted least square problem to 
iteratively approximate the ground truth yielded by a greedy 
algorithm. Extensive simulations and theoretical analysis vali¬ 
date that the proposed method is cost-efficient and is readily 
incorporated with the existing greedy algorithms to remarkably 
improve the performance for large-scale problems. 

Index Terms —Compressed/Compressive sensing. Greedy algo¬ 
rithm, Large-scale Data, Least square. Sparsity 

I. Introduction 

In this section, we first briefly introduce the background 
of compressive sensing (CS) in Sec. II-AI Then, the existing 
CS recovery algorithms (for large-scale signals) are discussed 
in Sec. II-BI Finally, the overview and contributions of our 
proposed method are described in Sec. II-CI followed by the 
organization of the remainder of this paper. 

A. Background 

Compressive sensing (CS) iiiiaE for sparse signals in 
achieving simultaneous data acquisition and compression has 
been extensively studied in the literature. CS is recognized to 
be composed of fast encoder and slow decoder. 

Let X S denote a iF-sparse 1-D signal to be sensed, let 

fl) G ^mxn ^ represent a sampling matrix, and let 
y G be the measurement vector. At the encoder, a signal 
X is simultaneously sensed and compressed via fl) to obtain a 
so-called measurement vector y as: 

y = <^x, ( 1 ) 
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which is usually called a procedure of ranfom projection. The 
measurement rate, defined as 0 < ^ < 1, indicates the com¬ 
pression ratio (without quantization) and is a major concern 
in many applications. mil show that it is a good choice to 
design $ as a Gaussian random matrix to satisfy either mutual 
incoherence property (MIP) or restricted isometry property 
(RIP). Moreover, sparsity is an inherent assumption made in 
compressed sensing to solve the underdetermined system in 
Eq. m due to M < N. Nevertheless, for real applications, 
natural signals are often not sparse in either the time or space 
domain but can be sparsely represented in a transform {e.g., 
discrete cosine transform (DCT) or wavelet) domain. Namely, 
cc = fl's, where is a transform basis (or dictionary) and s is 
a sparse representation with respect to flr. So, Eq. O is also 
rewritten as: 

y = = As, (2) 

where A = G We say that x is AT-sparse if s 

contains only K non-zero entries (exactly AT-sparse) or K 
significant components (approximately AT-sparse). 

At the decoder, the original signal x can be perfectly 
recovered by an intuitive solution to CS recovery, called 
minimization, which is defined as: 

min||s||o s.t. ||y-As|| 2 <e, (3) 

S 

where e is a tolerable error term. Due to M < N, this 
system is underdetermined and there exists infinite solutions. 
Thus, solving fg-minimization problem requires combinatorial 
search and is NP-hard. 

Alternative solutions to Eq. Q usually are based on two 
strategies: convex programming and greedy algorithms. Eor 
convex programming, researchers IT] |3l have shown that when 
M > O(Ariog^) holds, solving £g-minimization is equiva¬ 
lent to solving -minimization, defined as: 

min||s||i s.t. \\y - As \\2 < e. (4) 

S 

Typical -minimization models include Basis Pursuit (BP) ||4| 
and Basis Pursuit De-Noising (BPDN) 0 with the computa¬ 
tional complexity of recovery being polynomial. Eor greedy 
approaches, including Orthogonal Matching Pursuit (OMP) 
E), CoSaMP Q, and Subspace pursuit (SP) 0, utilize a 
greedy strategy for support detection first and then solve a 
least square problem to recover the original signal. The main 
difference among these greedy algorithms is how the support 
detection step is condcuted. 
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Nevertheless, for large-scale signals (e.g., N > 2^°), both 

-minimization and greedy algorithms suffer from high com¬ 
putational complexity and massive memory usage. Ideally, 
the memory costs at the encoder and decoder are expected 
to approximate 0{M) and 0{N), respectively, which are 
the minimum costs to store the original measurement vector 
y and signal x. If A is required to be stored completely, 
however, it will cost 0{MN) bytes (e.g., when N = 
and M — 2^®, the sensing matrix needs several terabytes). It 
often overwhelms the capability of existing hardware devices. 

In view of the incoming big data era, such a troublesome 
problem needs to pay immediate attention. In this paper, 
we say that X € ^NixN 2 x — xNd ^ Qf 

if iVi X 7V2 X • • • X Nd is lare enough or more specifically 
its size approaches the storage limit of hardwares like PC, 
notebook, and so on. 

B. Related Work 

The existing methods that can deal with compressive sens¬ 
ing of large-scale signals are discussed in this section. As 
mentioned above, we focus on computational efficiency and 
memory usage. Basically, our survey is conducted from the 
aspects of encoder and decoder in compressive sensing. We 
mainly discuss block-based, tensor-based, and operator-based 
compressive sensing algorithms here. 

1) Strategies at Encoder: From Eq. O, we can see that 
both the storage (for T') and computation (for vPa:) costs 
require 0{MN) bytes and 0{MN) operations, respectively. 
When the signal length becomes large enough, storing <!> and 
computing become an obstacle. 

In the literature, Gan ifTTIl and Mun and Fowler ifT^ propose 
block-based compressive sensing techniques, wherein a large- 
scale signal is separated into several small block signals, 
which are individually sensed via the same but smaller sensing 
matrix. The structure of block sensing reduces both storage 
and computation costs to where B is the number 

of blocks. Although block-based compressive sensing can 
deal with small blocks quickly and easily, it actually cannot 
work for the scenario of medical imaging in that an image 
generated from the fast Fourier Transform (FFT) coefficients 
of an entire sectional view lITSl violates the structure of block- 
based sensing. 

Shi et al. im and Caiafa and Cichocki El consider the 
problem of large-scale compressive sensing based on tensors. 
In other words, the signal is directly sensed and reconstructed 
in the original (high) dimensional space instead of reshaping 
to 1-D. For example, a 2-D image X € ]^>/ivxvW sensed 
via 

F = (5) 

where $1 and $2 £ and Y G j^iis 

strategy is often called separable sensing ll20l . ll2Tll . In this 
case, both the storage and computation costs are reduced to 
0{\/MN). ifTOl further presents a close-from solution for re¬ 
construction from compressive sensing based on assuming the 
low-rank structure. It should be noted that since tensor-based 
approaches, in fact, change the classical sensing structure (i.e., 
y = $a;) of CS, the decoder no longer follows the conventional 


solvers like Eq. (|4]i. Specifically, the measurements in tensor- 
based approaches form a tensor but conventional solvers only 
accept one-dimensional measurement vector. 

In addition to block-based and tensor-based approaches, 
operator-based approaches are to design $ as a deterministic 
matrix or structurally random matrix, implemented by certain 
fast operators. For example, Candes et al. Il22l propose the 
use of a randomly-partial Fourier matrix as $. In this case, 
we can implement ^x by V (FFT(x)), where FFT(-) is the 
function of fast Fourier transform (FFT) and V {■) denotes 
a downsampling operator that outputs an M x 1 vector. 
Thus, <!> is not necessarily stored in advance. In addition, the 
computation cost also becomes 0{N log N), which especially 
outperforms 0{MN) for large-scale signals because M is 
positively proportional to N. Do et al. Il23ll further propose 
a kind of random Gaussian-like matrices, called Structurally 
Random Matrix (SRM), which benefits from operator-based 
strategy and achieves reconstruction performance as good 
as random Gaussian matrix. In sum, since operator-based 
approaches follow the original CS structure, the decoder is 
not necessary to be modified. 

2) Strategies at Decoder: For block-based approaches 
MM, each block can be individually recovered with low 
computation cost and memory usage but incurs blocky effects 
between boundaries of blocks. In na, Mun and Fowler pro¬ 
pose a method, called BCS-SPF, which further removes blocky 
effects by Wiener filtering. In addition to the incapabliity 
of sensing medical images like MRI, BCS-SPF is also not 
adaptive in that the measurement rates are fixed for different 
blocks by ignoring the potential differences in smoothing 
blocks that need less measurement rates and complex blocks 
that require more measurement rates. 

For tensor-based compressive sensing, El develops a new 
solver called N-way block OMP (N-BOMP). Though N- 
BOMP is indeed faster than conventional CS solvers, its 
performance closely depends on the unique sparsity pattern, 
i.e., block sparsity, of an image. Specifically, block sparsity 
states that the importnat components of an images are clustered 
together in blocks. This characteristic seems to only naturally 
appear in hyperspectral imaging. In ll24ll . a multiway compres¬ 
sive sensing (MWCS) method for sparse and low-rank tensors 
is proposed. MWCS achieves more efficient reconstruction, 
but its performance relies heavily on tensor rank estimation, 
which is NP-hard. A generalized tensor compressive sensing 
(GTCS) method 1251 . which combines £i-minimization with 
high-order tensors, is beneficial for parallel computation. 

For operator-based compressive sensing algorithms, since 
the conventional solvers, mentioned in the previous subsection, 
still can be used, here we mainly review state-of-the-art convex 
optimization algorithms focusing on the large-scale problem, 
where only simple operations such as A and conducted 
by operator are required. 

Cevher et al. Il26l point out that an optimization algorithm 
based on the first-order method such as gradient descent 
features nearly dimension-independent convergence rate and 
is theoretically robust to the approximations of their oracles. 
Moreover, the first-order method such as NESTA Et) often 
involves the transpose of sensing matrix, which is easily 
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TABLE I 

Comparison between each algorithm. 


algorithms 

sensing strategy 

assumptions 

algorithm type 

storage 

N-BOMP (9) 

tensor-based (2D) 

block sparsity 

greedy 

0{VMN) 

Qo) 

tensor-based (2D) 

low multilinear-rank 

closed-form 

0{VMN) 

BCS QI] 

block-based (ID) 

- 

Landweber-based 

O(^) 

BCS-SPL 1121 

block-based (ID) 

- 

Landweber-based 

o(#) 

[13] 

conventional (ID) 

- 

Armijo-based 

0(MN) 

GPSR Il4l. SpaRSA (isP 

conventional (ID) 

- 

IST-based 

0{MN) 

Giin] 

conventional (ID) 

- 

kFC_AS-based 

0(MN) 


implemented by operator. Both GPSR m and SpaRSA 
ifTSl are closely related to iterative shrinkage/threshold (1ST) 
methods and support the operator-based strategy. In addition, 
nsuniEsi have shown that algorithms based on solving 
fixed-point equation have fast convergence rate, which can 
be combined with operator-based strategy too. For example, 
Milzarekef al. ifT^ further propose a globalized semismooth 
Newton method, where partial DCT matrix is adopted as 
the sensing matrix for fast sensing. But, it requires that 
signals are sparse in the time/spatial domain leading to limited 
applications. 

3) Brief Summary of Related Works: Table |T] depicts the 
comparisons among the aforementioned algorithms, where 
storage is estimated based on non-operator version. If operator 
can be used, the storage of storing a sensing matrix is not 
required and, thus, is bounded by 0{N) for each row vector, 
which is only related to the minimum requirement for storing 
the reconstructed signal. Since the characteristic of compres¬ 
sive sensing states that CS encoder spends lower memory 
and computation cost than CS decoder, when taking hardware 
implementation in real world into consideration, tensor-based 
methods are more complicated than others. For example, the 
single-pixel camera designed in ll29l uses a DMD array as 
a row of A to sense x. By changing the pattern of DMD 
array M times, the measurements are collected. This structure, 
however, cannot support separable sensing that is commonly 
used in tensor-based methods. Block-based CS methods do 
not intrinsically overcome large-scale problems and lack con¬ 
vincing theoretical proof about complexity, performance, and 
convergence analysis. Operator-based CS methods maintain 
the original structures of CS encoder and decoder. Thus, most 
of the existing fast algorithms for £i-minimization can be used 
only if all of matrix operations can be executed in an operator 
manner. Furthermore, they have strong theoretical validation 
since -minimization is a well-known model and has been 
developed for years. In fact, the operator-based strategy can 
also be employed in tensor-based and block-based compressive 
sensing methods to partially reduce their computation cost and 
storage usage. 

C. Contributions and Overview of Our Method 

Up to now, it is still unclear how greedy algorithms can deal 
with large-scale problems by utilizing operator-based strategy. 
Although SparseLab releases the OMP code combined with 
operator, the program still cannot deal with large-scale signals. 
This challenge is the objective of this paper and, to our 


knowledge, we are the first to explore this issue. In fact, our 
idea can help all greedy algorithms to deal with large-scale 
signals. We will discuss the problem in detail in Sec. |II] 

Generally, greedy algorithms are conducted by iteratively 
executing two main stages: (a) support detection and (b) 
solving least square problem with the known support. To 
reduce the cost of support detection, we follow the common 
strategy of adopting operator-based, instead of matrix-based, 
design of a sensing matrix {e.g., ED). Therefore, we no longer 
discuss this step in this paper as it is not the focus of our 
method. 

For solving the least square problem, we propose a fast 
and cost-effective solver by combining a Conjugate Gradient 
(CG) method with a weighted least square model to iteratively 
approximate the ground truth. In our method, the memory cost 
of solving least square problem is reduced to 0{N), and the 
computation cost of CG method is approximately 0{N log N) 
for finite floating point precision and 0{KN log N) for exact 
precision. 

In should be noted that although using CG to solve the least 
square problem is not new, our extended use of CG brings 
additional advantages. For example, Blumensath et al. El 
proposed “Gradient Pursuits (GP)” in which the memory cost 
is dominated by 0{MN) to save <I> (see Table 1 in |l30l), 
which cannot be stored explicitly for large-scale problem. Our 
method extends GP to reduce the memory cost by using SRM 
to avoid saving <I>. In addition, we reformulate a least square 
problem used in GP into a weighted one and show both models 
are equivalent. More specifically, solving weighted least square 
problem only requires $ and can benefit from fast computation 
of operator-based approaches (as in SRM). Traditional least 
square problem, however, involves sub-matrices of $ and 
cannot directly be conducted by fast operator. 

On the other hand, “iteratively reweighted least-square 
(IRLS)” was proposed in OTll . Though both IRLS and our 
method involve weighting, they are totally different. First, 
IRLS uses weighting to approximate ^i-norm solution instead 
of f 2 -norm solution in original least square problem while our 
greedy method still solves £ 2 -norm solution in the weighted 
least square problem. Second, IRLS is not a greedy method. 

Moreover, we conduct extensive simulations to demonstrate 
that our method can greatly improve OMP HD, Subspace 
pursuit (SP) E], and OMPR lf32l in terms of memory usage 
and computation cost for large-scale problems. 
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D. Outline of This Paper 

The rest of this paper is organized as follows. In Sec. |II] 
we describe the bottleneck of current greedy algorithms that 
is to solve the least square problem. The proposed idea of 
fast and cost-effective least square solver for speeding greedy 
approaches along with theoretical analysis is discussed in Sec. 
m In Sec. lYl extensive simulations are conducted to show 
that our method indeed can be readily incorporated with state- 
of-the-art greedy algorithms, including OMR, SP, and OMPR, 
to improve their performance in terms of the memory and 
computation costs. Finally, conclusions are drawn in Sec. lYl 

II. Problem Statement 

In this paper, without loss of generality, we focus on a 
signal G -^NixN 2 x — xNd^ where N = Ni and N = Ni x N 2 
are large enough with respect to D = 1 and D = 2. When 
D = 2, the signal usually is reshaped to 1-D form in the 
context of compressive sensing. Ideally, the memory cost of a 
compressive sensing algorithm should be 0{N), which is the 
minimum requirement for saving the original signal, x. The 
computational cost, however, depends on an algorithm itself. 
Since greedy algorithms share the same framework composed 
of support detection and solving least square problem, we shall 
focus on reducing the costs of these two procedures. 

In this section, we discuss the core of proposed fast and 
cost-effective greedy approach and take OMP as an example 
for subsequent explanations. We shall point out the dilemma 
in terms of memory cost and computation cost when handling 
large-scale signals. In fact, both costs suffer from solving the 
least square problem, which cannot be conducted by operator 
directly. 

First, we follow the notations mentioned in the previous 
section and briefly introduce OMP 16] in a step-by-step manner 
as follows. 

1) Initialize the residual measurement tq = y and initialize 
the set of selected supports So = {}. Let the initial 
iteration counter be i = 1. Let As be the sub-matrix of 
A, where As consists of the column of A with indices 
belonging to the support set S. is the transpose of A. 

2) Detect supports (or positions of significant components) 
by seeking maximum correlation from 

t = argmax | (A^r^-i)^ |, (6) 

and update the support set Si = S'i_i U {f}. 

3) Solve a least square problem Sj = (A^. AsJ^^A^.y, and 
update residual measurement Vi = y — As^s,. 

4) If i = K, stop; otherwise, i = i + \ and return to Step 2. 

Tropp and Gilbert ||6l derive that the computational com¬ 
plexity of OMP is bounded by Step 2 (support detection) 
with 0{MN) and Step 3 (solving least square problem) 
with 0{MK), and the memory cost is 0{MN) when A is 
executed in a matrix form. In this paper, we call it matrix- 
based OMP (M-OMP). As mentioned in Sec. II-BI A can be 
designed to be an SRM conducted by operator. Nevertheless, 
operator is only helpful for certain operations such as A and 
A^. For example, if A is a partial random Fourier matrix. 
Ax = 'D{FFT{x)) and A^y = IFFT(y) can be quickly 


calculated, where y = 0,..., 0]^ and IFFT{-) denotes 

N-M 

inverse FFT function. In this paper, we call it operator-based 
OMP (O-OMP). 

Unfortunately, the key is that (A^.AsJ”^ still can¬ 
not be quickly computed in terms of operator. Hence, 
it requires 0{KM) to store Ag^. Instead of calculating 
(Ag. AsJ”^ directly, by preserving the Cholesky factorization 
of (Ag^_^A 5 ^_ at the [i — 1)*^ iteration for subsequent 
use. Step 3 is accelerated and the memory cost is reduced 
to 0{K^). Moreover, it is worth mentioning that the sparsity 
K of natural signals is often linear to signal length N. For 
example, the number of significant DCT coefficients for an 
image usually ranges from O.OIW to OAN. Also CS has shown 
that M must be linear to K log N for successful recovery with 
high probability. Under the circumstance, 0{N) is equivalent 
to 0{K) and 0{M) in the sense of big-O notation. We can 
see that when N is large, 0{K^) dominates the memory cost 
because ^ N. Thus, Step 3 makes OMP infeasible for 
recovering large-scale signals. 

In fact, greedy algorithms share the same operations, i.e., 
A^Xi-i in Step 2 and (A^^AsJ“^Ag.j/ in Step 3, where 
the main difference is that the support set St is found 
by different ways, and face the same dilemma. A simple 
experiment is conducted and results are shown in Fig. [T] 
to illustrate the comparison of memory usage among M- 
OMP, O-OMP, and ideal cost (which is defined as Wx 8 
bytes required for Double data type in Matlab). The OMP 
code running in Matlab was downloaded from SparseLab 
(http://sparselab.stanford.edu). Obviously, though the memory 
cost of O-OMP is reduced without storing A, it still far higher 
than that of ideal cost. It is also observed that both M-OMP 
and O-OMP exhibit the same slope. Specifically, M-OMP and 
O-OMP cost 0{MN) and 0{K^), respectively. As mentioned 
before, since M, K are linear to N, it means 0{K^) = 0{N^) 
and 0{MN) = 0{N‘^) such that both orders of memory cost 
of M-OMP and O-OMP are the same and are larger than that 
of ideal case. Consequently, solving the least square problem 
becomes a bottleneck in greedy approaches. This challenging 
issue will be solved in this paper. 



Fig. 1. Memory cost comparison among the matrix-based OMP, operator- 
based OMP, and ideal cost under M = ^ and K = ^. 
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III. Proposed Method for Speeding Recovery of 
Greedy Algorithms 

In this section, we first introduce how to determine a 
sensing matrix, which can be easily implemented by operator. 
Then, we reformulate the least square problem as a weighted 
one, which is solved by conjugate gradient (CG) method 
to avoid involving the sub-matrices, Asi or Ag . We first 
prove that the solutions to the least square problem and its 
weighted counterpart are the same, and then prove that the 
solutions to the weighted least square problem and its CG- 
based counterpart are the same. 

A. Sensing Matrix 

A random Gaussian matrix is commonly used as the sensing 
matrix as it and any orthonomal basis can pair together to 
satisfy RIP and MIP in the context of compressive sensing. 
The use of random Gaussian matrix as the sensing matrix, 
however, leads to the overhead of storage and computation 
costs. Although storage consumption can be overcome by 
using a seed to generate a random Gaussian matrix, it still 
encounters high computational cost. 

In ||23]| . Do et al. propose a framework, called Structurally 
Random Matrix (SRM), defined as; 

$ = DFR, (7) 

where D G ^ sampling matrix, F G is an 

orthonormal matrix, and R G is a uniform random per¬ 

mutation matrix (randomizer). Since the distributions between 
a random Gaussian matrix and SRM’s T* are verified to be 
similar, we choose Eq. (|7]) as the sensing matrix for our use. 
It should be noted that D and R can inherently be replaced by 
operators but F depends on what kind of orthonormal basis is 
used. It is obvious that any fast transform can be adopted as F. 
In our paper, we set F to the Discrete Cosine Transform (DCT) 
due to its fast computation and cost-effectiveness. There are 
literatures discussing the design of sensing matrix but it is not 
the focus of our study here. 


B. Reformulating Least Square Problem: Weighted Least 
Square Problem 

In Sec. im we describe that the bottleneck of greedy al¬ 
gorithms is the least square problem. To solve the problem, 
it is reformulated as a weighted least square problem in our 
method. To do that, we first introduce a weighted matrix 
W G defined as: 


WsAjA] 




( 8 ) 


where, without loss of generality. Si = { 1 , 2 , ...,*} denotes a 
support set at the z-th iteration and WsAfj] the 
entry of Ws^- As can be seen in Eq. (| 8 ]l, the weighte matrix 
W is designed to make the supports unchanged. 

Then, we prove that both solutions of the least square 
problem and its weighted counterpart are the same. 


Theorem 1. Suppose the sub-matrix A 5 . G R^^*^ of A has 
full column rank with support set Si- Let Si G R^ be the 
solution to the least square problem: 

Si = argmin \\y - As,s|| 2 , (9) 

S 

and let 9i G R^ be the solution to the weighted least square 
problem: 

9i = argmin \\y - AWsM^- (10) 

e 


We have 


s^[j] = 


for 1 < J < z. 

, -1 


Proof Let = 


0 ]^ = 


{aIAsA' 

0 




z/ be a 


solution minimizing \\y — AWsi9A\2- Then (0* -|- uju G 
Null{AWsi)} is the solution set of Eq. (fTOl i. Since A 5 . has 
full column rank with rank{Asf) = K and Null{AWsi) = 
span (ci+i, ei+ 2 , Cat), where Ci is a standard basis. Thus, 
no matter what v is, the first z entries of 0 , -|- u are invariant. 
We complete the proof. □ 


C. Reformulating Weighted Least Square as CG-based 
Weighted Least Square Problem 


We can see from Eq. (fTol) that the introduce of weighted ma¬ 
trix involves AWsi instead of submatrix Asi so that A can be 
calculated by fast operator. Nonetheless, the closed-form solu¬ 


tion of Eq. (doll is 


0 




y and still faces the dif¬ 


ficulty in that {Ag. As ^) ^ Ag, cannot be easily implemented 
by operator. Instead of seeking closed-form solutions, we aim 
to explore the first-order methods (e.g., gradient descent), 
which have the following advantages: ( 1 ) the operations only 
involve A and instead of the pseudo-inverse of A, (2) the 
convergence rate is nearly dimension-independent ll26ll . and (3) 
if A is a sparse matrix, the computation cost can be further 
reduced. Conjugate gradient (CG) El El is a well-known 
first order method to numerically approximate the solution of 
symmetric, positive-definite or positive-semidefinite system of 
linear equations. Thus, CG benefits from the advantages of 
the first-order method. However, the matrix AWs^ in Eq. (fTOl) 
is not symmetric. Thus, we reformulate Eq. (fTOl) in terms of 
CG as follows. We will prove that both the solutions to the 
weighted least square problem and CG-based weighted least 
square problem are the same. 


Theorem 2. Suppose the sub-matrix A 5 . G R'^^^ 0 /A has 
full column rank with support set Si. Let 9i G R^ be the 
solution to weighted least square problem defined in Eq. (O- 
Let 9 G R'^ be the solution to the CG-based weighted least 
square problem reformulated from Eq. m as: 

9i = argmin IIlEj A^(y-AlTs,0)11 2 . (11) 

§ 


Then ^ 

^z[j] = Hj]^ for 

Proof. We hrst note that now Wg.A^AWsi symmetirc to 

'UA^ Aq A^' 

meet the requirement of CG. Since \ Si y is an 
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optimal solution for both Eq. ( fTOl i and Eq. (fTTI) . the solution 
set of Eq. (HUl can be expressed as: 


{ 


y + v\AWsiV G Null{Wg.A ^)}. 


It should be noted that NulliWg.A^) = Null{A'gJ. 
In addition, AWs^v G C{AWsi), where C{AWsi) de¬ 
notes the column space of AWs^ and C{AWsi) = 
C{Asi). Since C{Asi)C\Null{Ag,) = {0}, it implies 

AWsiV = 0. In other words, v G Null{AWsi)- Due to 
Null{AWsi) — span (ci+i, 6 ^+ 2 ,cat), the first i entries of 
{AlAsX^Al 


set of Eq. (Uni) i 


y-\-v are invariant. Similarly, the solution 


IS 


{ 


{AlAs,)-'Al 


0 


y -f v\v G Null{AWsi)}- 


The first i entries of 
ant. We complete the proof. 


i^sAsi) A'^. 


y + vaie also invari- 

□ 


So far, we prove that, with correct support detection, the op¬ 
timal solution to the CG-based weighted least square problem 
in Eq. (fTTI) is equivalent to that to the original least square 
problem in Eq. (|9l). In addition, the matrix Wg.AJ'AWs^ 
in Eq. (fTTI) is symmetric and can quickly be solved by CG 
method. Nevertheless, 0^ points out that the system of linear 
equations with a positive-semidefinite matrix diverges unless 
some conditions are satisfied. Thus, Theorem [3 further shows 
the condition of convergence. 

Theorem 3. /EH? If W^.A^y G CiW^.A'^AWs,), CG 
method converges but the solution is not unique. 


Now, we check whether the CG-based weighted least square 
problem converges. Again, let Si = {l,2,...,i} denote a 
support set. Then, we have Wg.A^y = [y'^Asi 0]^ and 

0 

Because A'gAsi C is full rank, the first i entries of 

Wg A^y must be spanned by the basis of AgAst- The 
remaining N — i entries are 0 and trivial. Thus, it implies 
the CG part of our CG-based weighted least square solver 
satisfies Theorem [3 and converges. 


CiW^A^AWs,) =C{ 


D. Speeding Orthogonal Matching Pursuit and Complexity 
Analysis 

In the previous section, we descirbe the proposed CG- 
based weighted least square solver. In this section, we first 
show that the CG-based solver can be implemented easily by 
operator. Then, we combine it with OMP as a new paradigm 
to achieve fast OMP. Moreover, we discuss the convergence 
rate of CG and derive the computation complexity of proposed 
operator-based OMP via CG (dubbed as CG-OMP). Einally, 
we conclude that the memory cost of CG-OMP achieves ideal 
0{N) if T' is also conducted by operator. 

Algorithm [T] describes the proposed CG-OMP method 
(Lines 01 - 10), which employs a CG technique (Lines 11 


Algorithm 1 Proposed Orthogonal Matching Pursuit 


Input: y. A, K; Output: sk'. 

Initialization: i = 1, ro = y,so = So = {}; 


01. function Proposed Operator-based OMP() 

02. for i = 1 to K 

03. t = argmaxj j(A^ri-i)(-|; 

04. Si = Si-1 + t- 

05. Assign Ws^ according to Eq. 

06. Si =CG{A,Wsi,y,i)', 

07. ri = y - Asi', 

07. i = i + 1\ 

08. end for 

09. Return 

10. end function 

11. function [ s ]=CG{A,Wsi,y,i) 


12 . 

13. 

14. 

15. 

16. 

17. 

18. 

19. 

20 . 
21 . 
22 . 
23. 


b = WfA^y, H = WsA^AWsp, 
do = ro = 6, s = 0; 

7 = 0; 

while(||r3 ||2 < 0 


Hdi ’ 

^3 J 

s = s exjdj 

G+i = rj - ajHdj- 

= :I±^; 

I J I r-r^ 

3 ^ 

dj+i = rj+i + Pj+idj; 

3=3 + L 

end while 
Return:s; 


24. end function 


- 23). It is worth mentioning that ^ in Line 15 controls the 
precision of CG method. If ^ = 0, it means exact precision 
such that the output of CG method is equal to least square 
solution. If ^ > 0, CG method attains finite precision. How¬ 
ever, the result with finite precision is not certainly worse than 
that with exact precision especially under noisy interference, 
as later discussed in the 4-th paragraph of Sec. IIV-CI 

Now we check whether all matrix operations can be imple¬ 
mented by operators in Algorithm [T] in the following. 

. Ws, G 

Wsi is a diagonal matrix. Thus, WsiX is equal to assign 
^[j] = 0 for j ^ Si. The memory cost is 0{K) to store 
the indices of support set and the computation cost is 

0{N). 

. ^ DFR&R^'^^-. 

Dx is equal to randomly choose M indices from N 
entries in x. Fx = DCT{x), where DCT can be speeded 
by PET. Rx is equal to randomly permute the indices 
of vector x. The memory cost is bounded by 0{N) in 
order to store the sequence of random permutation and 
the computation cost is 0{N\ogN). 

. T' G 

If 4/ is a deterministic matrix, it is not necessarily to 
be stored. Thus, the memory cost of ika: is 0{N). 
The computational cost depends on whether ika: can be 
speeded up. In the worst case, it costs 0{N‘^). 

Other operations such as rjrj (Line 16) only involve mul¬ 
tiplications between vectors. Thus, both memory cost and 
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computation cost are bounded by 0{N). 

From the above analysis, one can see that the computation 
cost of Algorithm [1] is mainly bounded by 'F. We discuss some 
applications below, where the computation cost involving '1' 
is low. For compressive sensing of images, wavelet transform 
is often chosen as 'P such that 'Pa; costs 0{N). For magnetic 
resonance imaging (MRl), partial Fourier transform is selected 
as the component F of the sensing matrix expressed as DFR. 
In this case, is / in order to satisfy MIP or RIP, and costs 
nothing. Spectrum sensing is another application, where 'P is 
a discrete Fourier transform matrix done with 0{N log N). 

Now, the total cost of Algorithm [T] is discussed. Both the 
computation and memory costs of Lines 1-10 except Line 6 
(CG method) will be 0{N log W) and 0{N), respectively. As 
for the memory cost of CG method, it needs 0{N). Therefore, 
the total memory cost of Algorithm [T] is bounded by 0{N). 

In addition, the computation cost is related to two factors, 
i.e., the number of iterations to converge in CG and 'P. They 
are further discussed as follows. 

Theorem 4. (Theorem 2.2.3 in 1 1351? ) Let FI be symmetric and 
positive-definite. Assume that there are exactly k < N distinct 
eigenvalues in iJ. Then, CG terminates in at most k iterations. 

In our case, iJ = Wg.AJ'AWsi is positive-semidefinite 
instead of positive-definite. Thus, we derive the following 
theorem. 

Theorem 5. Given b = Wg.A^y and FI = AWs^, 

solving Eq. (EZI requires the number of iterations at most K 
in CG, where K is the sparsity of an original signal. 

Proof. Without loss of generality, let support set Si = 
{1, 2,..., i}. We start from another optimization problem; 

Si = argmin||As (j/-A s,s)|| 2 . (12) 

S 

Following the same skill in Theorem |2] Si is a unique and 
optimal solution to both Eq. (|9]l and Eq. (fTSl i. Thus, Si is also 
the solution of Eq. for the first i entries. Eurthermore, 
AgAsi non-singular such that AgAsi ^ positive-definite 
matrix and has at most i distinct eigenvalues. Erom Theorem 
m solving Eq. (fTSli requires at most i iterations. Then, we 
have b = WgA'^y = [y'^As^ 0]^ and H = Wg^A'"AWs^ = 

in Eq. (HD- When we only take the first 

i entries of b and the left-top i x i submatrix of H into 
consideration, it is equivalent to solving Eq. ([nil. This fact can 
be checked trivially by comparing each step of CG for both 
optimization problems in Eq. (fTTTi and Eq. (fTSTi . Thus, the first 
i entries of s in Eq. (fTTli is updated in the same manner with 
that of s in Eq. (IT^ . The remaining N — i entries of s are 
unrelated to convergence because the N — i entries of iT s are 
zero. In sum, the required number of iterations to converge 
in Eq. (fTTTl is identical to that in Eq. (fT^ . Hence, solving 
Eq. (HB also requires at most i iterations. Since i < K, the 
number of iterations is at most K. We complete the proof. □ 

Moreover, for each iteration in CG, the operation, Hdj, 
dominates the whole computation cost. It is obvious that if 'P 
can be executed with 0(N log N) or even lower computation 


AlAs, 0 
0 0 


complexity, FIdj costs 0(N log N) since iJ involves T', 
which spends 0{N logN). Note that the total computation 
complexity of CG-OMP will be 0(NK'^ log N), where 
comes from the outer loop in OMP, which needs 0{K), and 
solving Eq. (HB that requires the number of iterations at most 
K in CG, as proved in Theorem|5] On the other hand, if, under 
the worst case, 'P costs 0(N‘^) operations, the computation 
complexity of CG-OMP will be 0{N‘^K'^). Eor applications 
that accept finite-precision accuracy instead of exact precision, 
CG ll^ requires fewer steps (< K) to achieve approximation. 
Under the circumstance, the complexity of CG-OMP nearly 
approximates 0(NKlogN) and 0{N‘^K) operations for 'P 
with complexity 0(N logN) and 0(N‘^), respectively. 

Consequently, a reformulation for solving a least square 
problem is proposed based on CG such that the new matching 
pursuit methodology can deal wtih large-scale signals quickly. 
It should be noted that, in the future, CG may be substituted 
with other first order methods that outperform CG. The 
proposed idea can also be readily applied to other greedy 
algorithms to enhance their performance. 

E. Strategies for Reducing the Cost of 'P 

We further consider that if 'P is a learned dictionary, it 
will become a bottleneck for operator-based algorithms since 
it requires 0{N‘^) for storage. To overcome this difficulty, 'P 
should be learned in a tensor structure. Let x = vec{X), where 
X £ ^CnxVn vecf) is a vectorization operator. That is, 
a two-dimensional vector is reshaped to a one-dimensional 
vector. Then, we can learn a 2D dictionary such that x = 
vps = r;ec('Pi5''P|’) with s = vec(S) and 'P = 'P 2 C) T'l (0 is 
a Kronecker product). Under the circumstance, all operations 
in CG involving can be replaced by <Pr;ec('Pi5''P|^). 

Moreover, both 'Pi and 'P 2 only require 0{N) in terms of 
memory cost. In the literature, the existing algorithms for 2D 
separable dictionary learning include 1191 lIMIl llJTl . 

IV. Experimental Results 

In this section, we conduct comparisons among O-OMP, 
M-OMP, and CG-OMP in terms of the memory cost and 
computation cost. The code of OMP was downloaded from 
SparseLab (http;//sparselab.stanford.edu). We have also ap¬ 
plied the proposed fast and cost-effective least square solver 
to SP and OMPR in order to verify if our idea can speed the 
family of matching pursuit algorithms. Eor SP and OMPR, we 
implemented the corresponding original matrix-based versions 
(M-OMPR and M-SP), original operator-based versions (O- 
OMPR and O-SP), and proposed CG-based (CG-OMPR and 
CG-SP). It should be noted that although both SP and OMPR 
work well for increasingly adding a index to the support set 
like OMP, they are not accelerated by Cholesky factorization. 

A. Simulation Setting 

The simulations were conducted in an Matlab R2012b 
environment with an Intel CPU Q6600 and 4 GB RAM under 
Microsoft Win? (64 bits). 

The model for the measurement vector in CS is y = As-l- 77 , 
where rj is an addictive Gaussian noise with standard deviation 




arf. The input signal s was produced via a Gaussian models 
as: 

s^pN{0,al„), (13) 

which was also adopted in jSS). In Eq. (fOl l. p is the probability 
of the activity of a signal and controls the number of non¬ 
zero entries of x. Sparsity K is defined to be X = pN. 
(Ton is Standard deviation for input signal. $ is designed from 
SRM and is chosen to be a discrete cosine transform. In 
the following experiments, M = ^, K = ^, p = 0.0625, 
Ton = 1, and Trj = 0.01. 

B. Memory Cost Comparison 

Fig.m shows the comparison in terms of memory cost vs. 
signal length N. Since the matrix-based algorithms (M-OMP, 
M-OMPR, and M-SP) run out of memory, their results are not 
shown in Fig. |2] (note that the result regarding matrix-based 
OMP can be found in Fig. [T]). First, we can observe from 
Fig. |2] that O-OMP, O-OMPR, and O-SP still require about 
O(iV^) and fail to work when N > 2^®. Second, in contrast 
with O-OMP, O-OMPR, and O-SP although CG-OMP, CG- 
OMPR, and CG-SP need more memory costs than the ideal 
cost, which is 0{N), their slopes are nearly identical, which 
seems to imply that the CG-based versions incur larger Big-O 
constants. Thus, the proposed idea of fast and cost-effective 
least square solver is readily incorporated with the existing 
greedy algorithms to improve their capability of handling 
large-scale signals. 



Fig. 2. Memory cost vs. Signal length, where M = -^ and P = ^ 
{p = 0.0625). 


C. Computation Cost Comparison 

Before illustrating the computation cost comparison, we 
hrst discuss the convergence condition of CG in Algorithm 
[Has follows: 1) For exact precision, ||rj ||2 = 0 is set as the 
stopping criterion. As described in Theorem|5] it costs at most 
K iterations to converge. 2) With inexact or finite precision, 
we set ||rj ||2 < f with ^ > 0. Though the precision is finite, 
the signals, in practice, are interfered with noises and solutions 
with hnite precision are adequate. Under the circumstance, the 
number of iterations required to converge will be significantly 
decreased. 


Figs. lUa), (b), and (c) show the computation cost vs. 
signal length for OMP, OMPR, and SP, respectively, under 
the condition that the precision of CG was set to be exact. 
In other words, we hx the same reconstruction quality for all 
comparisons in Fig. |2 and discuss the computation costs for 
these different versions of algorithms. It should be noted that 
some curves are cut because of running out of memory. 

From Fig. |3| it is observed that the operator-based strategy 
(denoted with dash curves) or our proposed CG-based method 
(denoted with solid curves) can effectively reduce the order of 
computation cost in comparison with the matrix-based strategy 
(denoted with solid-star curves). More specifically, in Fig. 
Ea), it is noted that M-OMP is only fast than CG-OMP with 
N < 2^^ due to smaller Big-O constant. However, CG-OMP 
outperforms M-OMP in the end since the order of computation 
complexity of CG-OMP is lower than that of M-OMP. In 
particular, such improvements are significant for large-scale 
signals (with large N). Moreover, in Figs. Eb) and (c), O- 
OMPR and O-SP have the same orders with M-OMPR and 
M-SP because no Cholesky factorization is used. 

On the other hand, when hnite precision is considered, we 
consider two cases of setting ^ = 10“® and ^ = 10“^° to 
verify that hnite precision is adequate under the condition of 
noisy interferences. Taking exact precision as the baseline, the 
difference of SNR values for reconstruction between settings 
for exact precision and ^ = 10“® is about ±0.1 dB. Similarly, 
the difference between exact precision and ^ = 10“^° is 
about ±0.01 dB. Tables |II] [III] and HVl further illustrate the 
comparisons of computation costs under different precisions. 
The precision setting ^ = 10“^° results in about four times 
faster than the exact precision but only sacrihces ±0.1 dB 
for the reconstruct quality, which is acceptable for many 
applications. In fact, the performance occasionally is better 
because exact precision may lead to over-htting. 

V. Conclusions 

The bottleneck of greedy algorithms in the context of 
compressive sensing is to solve the least square problem, in 
particular, when facing large-scale data. In this paper, we 
address this challenging issue and propose a fast but cost- 
effective least square solver. Our solution has been theo¬ 
retically proved and can be readily incorporated with the 
existing greedy algorithms to improve their performance by 
significant! reducing computation complexit and memory cost. 
Case studies on combining our method and OMP, SP, and 
OMPR have been conducted and shown promising results. 
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